NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Rates of Convergence in the Central Limit Theorem for Markov Chains, with an Application to TD Learning

https://doi.org/10.1287/moor.2024.0444

Srikant, R (October 2025, Mathematics of Operations Research)

We prove a nonasymptotic central limit theorem (CLT) for vector-valued martingale differences using Stein’s method, and we use Poisson’s equation to extend the result to functions of Markov chains. We then show that these results can be applied to establish a nonasymptotic CLT for temporal difference learning with averaging. Funding: This work was supported by National Science Foundation [Grants CNS 23-12714, CCF 22-07547, and CNS 21-06801] and Air Force Office of Scientific Research [Grant FA9550-24-1-0002].
more » « less
Full Text Available
Optimal Hybrid Feedback-Driven Learning for Wireless Interactive Panoramic Scene Delivery

https://doi.org/10.1145/3704413.3764474

Wu, Xiaoyi; Steiger, Juaren; Li, Bin; Srikant, R (October 2025, ACM)

Full Text Available
Joint Optimal Transport and Embedding for Network Alignment

Qi, Yu; Zeng, Zhichen; Yan, Yuchen; Ying, Lei; Srikant, R; Tong, Hanghang (April 2025, WWW 2025)

Full Text Available
Exploration-Driven Policy Optimization in RLHF: Theoretical Insights on Efficient Data Utilization

Du, Y; Winnicki, A; Dalal, G; Mannor, S; Srikant, R (July 2024, International Conference on Machine Learning)

Reinforcement Learning from Human Feedback (RLHF) has achieved impressive empirical successes while relying on a small amount of human feedback. However, there is limited theoretical justification for this phenomenon. Additionally, most recent studies focus on value-based algorithms despite the recent empirical successes of policy-based algorithms. In this work, we consider an RLHF algorithm based on policy optimization (PO-RLHF). The algorithm is based on the popular Policy Cover-Policy Gradient (PC-PG) algorithm, which assumes knowledge of the reward function. In PO-RLHF, knowledge of the reward function is not assumed and the algorithm relies on trajectory-based comparison feedback to infer the reward function. We provide performance bounds for PO-RLHF with low query complexity, which provides insight into why a small amount of human feedback may be sufficient to get good performance with RLHF. A key novelty is our trajectory-level elliptical potential analysis technique used to infer reward function parameters when comparison queries rather than reward observations are used. We provide and analyze algorithms in two settings: linear and neural function approximation, PG-RLHF and NN-PG-RLHF, respectively.
more » « less
Full Text Available
CASCADING REINFORCEMENT LEARNING

Du, Y; Srikant, R; Chen, W (May 2024, ICLR)

Cascading bandits have gained popularity in recent years due to their applicability to recommendation systems and online advertising. In the cascading bandit model, at each timestep, an agent recommends an ordered subset of items (called an item list) from a pool of items, each associated with an unknown attraction probability. Then, the user examines the list, and clicks the first attractive item (if any), and after that, the agent receives a reward. The goal of the agent is to maximize the expected cumulative reward. However, the prior literature on cascading bandits ignores the influences of user states (e.g., historical behaviors) on recommendations and the change of states as the session proceeds. Motivated by this fact, we propose a generalized cascading RL framework, which considers the impact of user states and state transition into decisions. In cascading RL, we need to select items not only with large attraction probabilities but also leading to good successor states. This imposes a huge computational challenge due to the combinatorial action space. To tackle this challenge, we delve into the properties of value functions, and design an oracle BestPerm to efficiently find the optimal item list. Equipped with BestPerm, we develop two algorithms CascadingVI and CascadingBPI, which are both computation-efficient and sample-efficient, and provide near-optimal regret and sample complexity guarantees. Furthermore, we present experiments to show the improved computational and sample efficiencies of our algorithms compared to straightforward adaptations of existing RL algorithms in practice.
more » « less
Full Text Available
The Role of Lookahead and Approximate Policy Evaluation in Reinforcement Learning with Linear Value Function Approximation

https://doi.org/10.1287/opre.2022.0357

Winnicki, Anna; Lubars, Joseph; Livesay, Michael; Srikant, R (May 2024, Operations Research)

In many applications of reinforcement learning, the underlying system dynamics are known, but computing the optimal policy is still difficult because the size of the state space can be enormously large. For example, Shannon famously estimated the number of states in chess to be approximately 10^120. To handle such enormous state space sizes, policy iteration algorithms make two major approximations; it is assumed that the value function lies in a lower-dimensional space and that only a few steps of a trajectory (called rollout) are used to evaluate a policy. Using a counterexample, we show that approximations can lead to the divergence of policy iteration. We show that sufficient lookahead in the policy improvement step mitigates this divergence and leads to algorithms with bounded errors. We also show that these errors can be controlled by appropriately choosing an appropriate amount of lookahead and rollout.
more » « less
Full Text Available
Modified Policy Iteration for Exponential Cost Risk Sensitive MDPs

Murthy, Y.; Moharrami, M.; Srikant, R. (June 2023, Learning for Dynamics and Control)

Full Text Available
On The Convergence Of Policy Iteration-Based Reinforcement Learning With Monte Carlo Policy Evaluation

Winnicki, A.; Srikant, R. (April 2023, AISTATS)

Full Text Available
Joint Optimal Transport and Embedding for Network Alignment

https://doi.org/10.1145/3696410.3714937

Yu, Qi; Zeng, Zhichen; Yan, Yuchen; Ying, Lei; Srikant, R; Tong, Hanghang (April 2025, ACM)

Full Text Available
Learning While Scheduling in Multi-Server SystemsWith Unknown Statistics: MaxWeight with Discounted UCB

yang, Z.; Srikant, R.; Ying, L. (April 2023, AISTATS)

Full Text Available

« Prev Next »

Search for: All records